The function \(\hat{f}\) in #2.2.1 is also called the Perceptron and dates back to Warren McCulloch and Walter Pitts in 1943. Due to the discontinuity of \(\hat{f}\), gradient methods for determining \(w\) and \(b\) based on a least squares method are not possible. An alternative method was proposed by Frank Rosenblatt in 1958. For this purpose, the perceptron loss is defined as
\[\mathcal{L}(w, b) := \frac{1}{N} \sum_{i=1} \text{max}(0, -y_i (\langle w, x_i \rangle + b))\]
Note that in the case where \(\langle w, x_i \rangle +b\) and \(y_i\) have the same sign, the \(i\)-th term in this sum is equal to zero. If the signs are different, the corresponding summand is equal to \(\langle w, x_i \rangle +b\) and thus proportional to the distance from \(x_i\) to the hyperplane.
If \(y_i\) is negative (e.g. \(-1\)) and \(\langle w, x_i \rangle +b\) is also negative (e.g. \(-2\)) then: \[-y_i (\langle w, x_i \rangle +b) = - ( -1) \cdot (-2) = -2\] and thus the max is \(0\)! Now if the signs are different e.g. \(y_i\) is \(1\) and \(\langle w, x_i \rangle +b\) is still \(-2\) then: \[-y_i (\langle w, x_i \rangle +b) = - (1) \cdot (-2) = 2\] (which is equal to \(\langle w, x_i \rangle + b\) and thus proportial to the distance from \(x_i\) to the hyperplane)
Therefore, misclassified points \(x_i\) that are further away from the hyperplane are penalized more heavily than those that are close. We note that #2.2.3 is a convex function in the sought variables \(w\) and \(b\). Unfortunately is is not differentiable in \(w\) and \(b\) at \(y_i(\langle w, x_i \rangle + b) = 0\). However, we still calculate the derivative of the function \[f_i(w, b) := \text{max}(0, -y_i (\langle w, x_i \rangle) + b)\] with respect to the two variables \(w \in \mathbb{R}^d\) and \(b \in \mathbb{R}\) and ignore the fact that the maximum function is not differentiable. At such points, we set the derivative to the value \(0\). The resulting derivative is called a subgradient of \(f_i\), and is given by \[\partial_w f_i(w, b) := -y_i x_i 1_{y_i(\langle w, x_i \rangle + b) < 0}\] \[\partial_b f_i(w, b) := -y_i 1_{y_i(\langle w, x_i \rangle + b) < 0}\] How are those computed?
The loss has two cases depending on whether the point is correctly classified or misclassified Case 1: correctly classified: If \(y_i (\langle w, x_i \rangle + b) \geq 0\) - The term inside \(\text{max}(0, -y_i(\langle w, x_i \rangle + b))\) is \(\leq 0\) (because of the negative sign in front of the \(y_i\)), so \(f_i(w, b) = 0\) - Gradients: Since the loss is a constant (0), the gradients are \[\partial_w f_i(w, b) = 0, \hspace{0.3cm} \partial_b f_i(w, b) = 0\] Case 2: Misclassified If \(y_i(\langle w, x_i \rangle + b) < 0\):
Rosenblatt’s algorithm is now a stochastic subgradient method with step size \(\tau > 0\) and batch size \(B \in \{1,...,N\}\) applied to \(L(w, b)\):
\[\begin{cases} \text{Initialize }w_0 \in \mathbb{R}^d, b_0 \in \mathbb{R} \\ \text{For } k \in \mathbb{N} \text{ iterate:} \\ \text{Choose a random B-element subset } \mathcal{I} \subset \{1,...,N\}, \\ \text{Define }\mathcal{M}:= \{i \in \mathcal{I} : y_i(\langle w, x_i \rangle +b) < 0\} (\text{misclassified points)}, \\ w_k := w_{k-1} + \frac{\tau}{B}\sum_{i\in \mathcal{M}}y_ix_i, \\ b_k := b_{k-1} + \frac{\tau}{B}\sum_{i \in \mathcal{M}}y_i \end{cases}\] The main problem with Rosenblatts algorithm is that we have no control over which hyperplane it converges to. In the case that the data are linearly separable, i.e. there exist \(w \in \mathbb{R}^d\) and \(b \in \mathbb{R}\) such that \(y_i(\langle w, x_i \rangle + b) \geq 0\) for all \(i = 1,...,N\), every hyperplane of this form is a fixed point of the algorithm (the algorithm stops updating once all points are correctly classified!)
Next up: 2.2.1.2 Support Vector Machines